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Abstract 

Hybridization between distantly related organisms can facilitate rapid adaptation to 
novel environments, but is potentially constrained by epistatic fitness interactions 
among cell components. The zoonotic pathogens Campylobacter coli and C. jejuni 
differ from each other by around 15% at the nucleotide level, corresponding to an average 
of nearly 40 amino acids per protein-coding gene. Using whole genome sequencing, 
we show that a single C. coli lineage, which has successfully colonized an agricultural 
niche, has been progressively accumulating C. jejuni DNA. Members of this lineage 
belong to two groups, the ST-828 and ST-1150 clonal complexes. The ST-1150 complex 
is less frequently isolated and has undergone a substantially greater amount of intro- 
gression leading to replacement of up to 23% of the C. coli core genome as well as 
import of novel DNA. By contrast, the more commonly isolated ST-828 complex bacte- 
ria have 10-11% introgressed DNA, and C. jejuni and nonagricultural C. coli lineages 
each have <2%. Thus, the C. coli that colonize agriculture, and consequently cause 
most human disease, have hybrid origin, but this cross-species exchange has so far not 
had a substantial impact on the gene pools of either C. jejuni or nonagricultural C. coli. 
These findings also indicate remarkable interchangeability of basic cellular machinery 
after a prolonged period of independent evolution. 
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Introduction 

Bacterial genomes show great flexibility in their genome 
size and composition and can acquire genes encoding 
entire metabolic pathways from other organisms (Lawrence 
1999; Ochman et al. 2000). Nevertheless, there are many 
species that are characterized by a large and stable 'core 
genome'. Although DNA within the core genome can 
be replaced in recombination events, this is almost 
always with homologous DNA from another member 
of the same species (Fraser et al. 2007). An important 
evolutionary question is what underlies this stability 
(Doolittle & Zhaxybayeva 2009). Is the core genome in 
each species a coadapted unit, such that equivalent 
genes taken from other species will not function prop- 
erly? Is each gene in the core genome adapted to the 
specific sets of environments that the species inhabits? 
Or is acquisition of DNA limited by mechanisms that 
prevent uptake of DNA from outside the species? 

Campylobacter are Gram-negative microaerophilic 
epsilon proteobacteria that inhabit the intestinal tracts 
of birds (Waldenstrom et al. 2002; Sheppard et al. 2010a) 
and other animals (Rosef et al. 1983) and have some 
capacity to survive in the nonenteric environment 
(Sopwith et al. 2008). They have relatively small 
genomes of approximately 1.6 megabases (Parkhill et al. 
2000), which limits the diversity of functions available 
to each organism. However, in common with many 
other bacteria, Campylobacter shows high levels of 
recombination (Suerbaum et al. 2001; Snipen et al. 2012), 
which might compensate for small genome size by 
providing each organism with the ability to import 
genes that confer adaptations to specific environments. 

Campylobacter jejuni and Campylobacter coll are among 
the main causes of human gastroenteritis worldwide, 
largely because of infection of farm animals and trans- 
mission through the food chain to retail products 
(Sheppard et al. 2009). Both species are associated with 
several agricultural hosts (Sheppard et al. 2010a). 
C. jejuni are usually more abundant in cattle and chick- 
ens, and C. coli dominates in pigs (Thakur et al. 2006). 
C. jejuni has also been isolated from many wild bird 
species (Waldenstrom et al. 2007; Colles et al. 2008), but 
little is known about the distribution of C. coli among 
wild hosts. 

Campylobacter '-like organisms were described by 
Theodor Escherich in 1886 (Escherich 1886), and the 
genus was formally named in 1963 (Sebald & Veron 
1963). C. coli, identified as vibrio isolated from pig fae- 
ces (Doyle 1948), was designated as a species distinct 
from C. jejuni in 1973 (Veron & Chatelain 1973), and 
this species classification has been uncontroversial. 
However, by analysing the sequence from 7 housekeep- 
ing loci from a large number of strains, we found 



evidence for the acquisition of substantial amounts of 
C. jejuni DNA by one of the three C. coli clades (Shepp- 
ard et al. 2008, 2011). Both the correctness and implica- 
tions of this finding have been debated (Cohan & 
Koeppel 2008; Doolittle 2008; Caro-Quintero et al. 2009; 
Lefebure et al. 2010), and many questions remain about 
the patterns of introgression in the core and pan 
genome, and how it has influenced the evolution of 
these important pathogens. 

Here, in addition to four previously published 
genomes, we chose 26 Campylobacter isolates from clini- 
cal, agricultural and nonagricultural sources (Tables SI 
and S2, Supporting information), to encompass the 
known diversity based on analysis of seven housekeep- 
ing loci (Sheppard et al. 2009). The genomes of the 
isolates were sequenced using the Illumina GA and 
Roche 454 platforms and assembled de novo. We per- 
formed model-based analysis of evolution of the core 
and pan genomes to reconstruct a history of between- 
species genetic exchange within the genus. 

Materials and methods 

Isolates and sequencing 

Isolates were chosen from multilocus sequence-typed 
collections to represent known diversity among C. jejuni 
and the three major C. coli clades, including nonagricul- 
tural strains for each lineage. These were cultured and 
genomic DNA was sequenced using Roche GS-FLX or 
Illumina Genome Analysers (see Appendix SI). Details 
of all the isolates used, including four complete C. jejuni 
genomes (Parkhill et al. 2001; Fouts et al. 2005; Pearson 
et al. 2007) from the NCBI database (accession numbers: 
NC_009839; NCJ308787; NC_003912; NC_002163), are 
included in Tables SI and S2 (Supporting information). 

Genetic relationships between C. coli and C. jejuni 

A schematic diagram of the genomics analysis pipeline 
is given in Fig. SI (Supporting information). The Bacte- 
rial Isolate Genome Sequence Database (Bicsdb) (Jolley 
& Maiden 2010) was used to store contiguous sequences 
and whole genome data from Genbank. Locus names 
and reference sequences were defined based upon the 
finished genome of isolate NCTC11168 (Cabello et al. 
1997; Parkhill et al. 2001; Gundogdu et al. 2007). The 
presence of a preliminary set of orthologs was defined 
by identifying reciprocal best hits to 11 168 loci, with at 
least 70% nucleotide identity and 50% difference in 
alignment length using the blast algorithm. The analysis 
of orthology was made for every genome, and the core 
genome, consisting of genes ubiquitous among isolates 
of the genus, was defined. 
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Gene orthologs were aligned on a gene-by-gene basis 
using muscle (Edgar 2004) and then concatenated into 
contiguous sequence for each isolate genome including 
gaps for missing nucleotides (or entire genes). A 
phylogeny of whole genome alignments (1.53 Mbp) was 
reconstructed using mega (Kumar et al. 2008) version 3.1 
with the Kimura 2-parameter model and neighbour- 
joining clustering. 

The ancestry of individual nucleotides was estimated 
using the model-based clustering algorithm imple- 
mented in the software structure (Falush et al. 2003). 
A file describing all the 239 543 nucleotide substitutions 
and the position of the polymorphic sites was 
constructed from the gene-by-gene alignment file, for 
loci present in all the genomes, structure was run for 
100 000 iterations following a 20 000 iteration burn-in. 
Genes were ordered by the amount of introgression in 
the 828 complex, identified with structure. For the 13 
most introgressed genes, also present in the analysis of 
Lefebure et al. (2010), individual neighbour-joining trees 
were constructed with and without our strains (Fig. S8, 
Supporting information). 

Pairwise alignments of genomes were generated 
using progressive mauve version 2.3.1 (Darling et al. 
2004, 2010) with the default parameter settings and 
analysed using a Bayesian change point model as 
described previously (Didelot et al. 2007). This model 
assumes that the level of nucleotide divergence between 
the two genomes follows a stepwise constant function 
and uses a reversible-jump MCMC to reconstruct this 
function. Histograms were then built to show the distri- 
bution of the level of divergence along the genomes. 

Event based analysis of C. coli clade 1 evolution 

Multiple sequence alignment of the contigs for each 
genome was performed using progressive mauve 
version 2.3.1 (Darling et al. 2004, 2010) with the default 
parameter settings. The progressive mauve backbone 
output file was used to assign regions of each genome 
as either core ('backbone') segments, conserved among 
all of the genomes, or accessory ('variable') segments 
absent from at least one alignment. Briefly, the multiple 
genome alignment was automatically analysed to iden- 
tify conserved segments using a homology hidden 
Markov model (Treangen et al. 2009). Regions where 
the posterior probability of sequence homology was 
>90% using a model trained on 80% identity and tuned 
to the sequence composition of Campylobacter were 
considered to be homologous. Nonhomologous regions 
create alignment gaps, and those alignment gaps were 
used to delineate a 'backbone' of conserved segments 
among each pair of genomes by simply calling any 
region with >20 nucleotides inserted or deleted in one 



genome as nonbackbone (indels > 20 nt). Pairwise 
backbone predictions were merged into multigenome 
backbone predictions using the previously described 
methods (Treangen et al. 2009). Using this technique, 
the amount of core and accessory genome was deter- 
mined for all the isolates and for C. jejuni and C. coli 
individually and the three C. coli clades separately and 
for subsets of isolates (Fig. S9, Supporting information). 

Additionally, a gene-by-gene alignment was extracted 
from Bicsdb (Jolley & Maiden 2010) for genes present in 
all the C. coli clade 1 genomes. A genealogy for these 
alignments was estimated using clonalframe, a model- 
based approach to determining microevolution in 
bacteria (Didelot & Falush 2007). This programme 
differentiates mutation and recombination events on each 
branch of the tree based on the density of polymor- 
phisms. Clusters of polymorphisms are likely to have 
arisen from recombination and scattered ones from 
mutation. Run on the C. coli clade 1 alignment, clonal- 
frame, estimated that recombination introduced poly- 
morphism at an average of 8% of affected sites. This 
value is higher than the genetic diversity within C. coli 
clade 1 and thus corresponds to imports from the other 
clades and species, as well as back-recombination 
events replacing previously introgressed DNA. The 
programme was run with 50 000 burn-in iterations fol- 
lowed by 50 000 sampling iterations. The consensus tree 
represents combined data from three independent runs 
with 75% consensus required for inference of related- 
ness. For each branch on the clonalframe genealogy, a 
list of homologous recombination events was extracted. 
Recombination events were defined as sequences of 
length >50 bp with a probability of recombination 
> 75% over the length reaching 95% in at least one site. 

To investigate the acquisition and loss of nonhomolo- 
gous DNA, we used the model-based Bayesian method 
implemented in genoplast (Didelot et al. 2009). This 
model allows the rates of gain and loss of genetic 
elements to be investigated over time in individual lin- 
eages. A multiple alignment was produced for C. coli 
clade 1 genomes (excluding isolate 16) using progres- 
sive MAUVE (Darling et al. 2004, 2010), and the 
conserved orthologous segments and repeat elements 
defined the core genome. Large gaps ( > 500 bp) in the 
alignment, where one or more genomes contain a 
sequence absent in the other genomes, identify the 
position of imported DNA in the accessory genome. 
A binary matrix of presence /absence of genetic features 
of length 50 bp was constructed using the bbFilter 
script distributed with MAUVE, genoplast was run with 
default parameters using this matrix and the genealogy 
inferred by clonalframe as input. 

The origin of homologous and nonhomologous 
recombination events in C. coli clade 1 genomes was 
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determined using the blast algorithm (Altschul et al. 
1990). A list of events identified with clonalframe and 
genoplast was extracted, and sequences were compared 
to a library database of all the C. jejuni and C. coli clade 
2 and 3 genomes. The origin of the events was assigned 
based on the similarity (S) to a library sequence. Specifi- 
cally, the E-value that measures the reliability of the S 
score was calculated for all blast matches ( > 70% iden- 
tity, 50% alignment), and the event was inferred to have 
originated in the species /clade containing homologous 
sequence with the lowest E-value. 

In addition to determining the clonal genealogy and 
the origin of recombination, clonalframe analysis infor- 
mation was used to investigate the impact of homologous 
recombination with C. jejuni on sequence divergence in 
C. coli clade 1 using sequence variation at 51 ribosomal 
protein (rps) subunit loci (see Appendix SI). 

Functional analysis of introgression 

The proportion of nucleotide substitutions that changed 
the amino acid sequence in homologous sequence was 
investigated for recombinant C. jejuni genes found in 
C. coli clade 1. Genome comparison was made between 
an example C. jejuni (isolate 4) and an unintrogressed 
C. coli (isolate 23). These strains shared 1081 genes, 
defined as homologous sequence with >70% identity 
over >50% of the gene alignment; 584 of these genes 
were involved in recombination in at least one C. coli 
clade 1 isolate and 497 were not. The number of non- 
synonymous differences (N), number of synonymous 
differences (S) and ratio of nonsynonymous to synony- 
mous mutations (dn/ds) were determined from 
gene-by-gene alignments of recombining and nonre- 
combining genes using mega software version 3.1 
(Kumar et al. 2008). If recombinant sequence is removed 
by the action of selection against divergent amino acid 
sequence, then there will be a greater than expected 
number of synonymous substitutions in recombined 
genes and a lower dn/ds ratio. 

To investigate the relationship between the rate of 
genetic import and the functional category of genes, the 
number of genes involved in homologous recombination 
was determined for each cluster of orthologous groups 
(COG) category. For each COG category, the number 
and total length (bp) of imports were determined. This 
allowed the determination of the rate of imports per 
nucleotide and the proportion of genes from each COG 
involved in recombination. Some genes are present in 
two or more COG categories and were counted once for 
each COG. A second analysis of the function of recom- 
bined genes was carried out by determining the genes 
that were found only in C. jejuni and C. coli clade 1 and 
were absent in unintrogressed C. coli. 



Genome comparison was carried out to identify 
differences in the gene content of the C. jejuni and 
C. coli genomes. By organizing C. jejuni genes absent 
from unintrogressed C. coli into functional categories, 
groups of genes of related function were identified. 
A second comparative analysis identified genes that 
were found only in C. jejuni and introgressed C. coli 
genomes. 

Results 

Our initial analysis focussed on the core genome. The 
NCTC11168 isolate has a 1.6 Mb genome (Parkhill et al. 
2000), and 0.96 Mb was aligned in all our isolates using 
mauve. Based on locus designations for NCTC11168 
(1623 genes), there were 542 genes with orthologues in 
all the isolates (genes with at least 70% nucleotide iden- 
tity and a minimum of 50% alignment length). Under 
the same criteria, there were 819 genes (50%) common 
to all C. jejuni isolates and 928, 1084 and 1078 common 
to C. coli clades 1, 2 and 3, respectively. 

We first constructed a neighbour-joining tree based 
on average genetic distances amongst isolates (Fig. 1A). 
On the tree, C. jejuni, C. coli clade 2 and C. coli clade 3 
isolates each formed discrete clusters. However, isolates 
previously designated as C. coli clade 1 were found in 
three places on the tree. The ST-828 and ST-1150 clonal 
complexes, which account for the great majority of 
strains found in agriculture and human disease (Shepp- 
ard et al. 2010b), formed discrete clusters separate from 
the two environmental C. coli clade 1 isolates. 

Evidence for introgression 

Three lines of evidence show that the large genetic 
distances among C. coli clade 1 isolates, illustrated by 
the neighbour-joining tree (Fig. 1A), are a consequence of 
the import of DNA from C. jejuni rather than accumula- 
tion of mutations during a prolonged period of separate 
evolution. The first used the linkage model of structure 
(Falush et al. 2003) that reconstructs ancestral popula- 
tions from DNA polymorphism data. When run assum- 
ing two ancestral populations, the inferred ancestral 
sources corresponded to C. jejuni and C. coli. The 
human and agricultural C. jejuni isolates had between 
0.4% and 1.7% inferred C. coli ancestry, consistent with 
a low level of import. Excluding isolates from the 
ST-828 and ST-1150 clonal complexes, the C. coli isolates 
showed a comparable amount of inferred C. jejuni 
ancestry that ranged from 0.2% to 1.2%. The ST-828 and 
ST-1150 clonal complexes showed substantially more 
evidence for DNA import from C. jejuni ranging from 
9.7% to 11.2% and 20.4% to 22.5%, respectively, spread 
throughout the genome (Fig. IB). 
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Fig. 1 Ancestry of Campylobacter jejuni and C. coli. (A) Neighbour-joining tree of 30 C. jejuni and C. coli genomes. Isolates belonging 
to C. jejuni are shown in blue, and those belonging to C. coli clade 1 are indicated in red, clade 2 in yellow, and clade 3 in green. The 
scale bar represents a genetic distance of 0.01. (B) Genetic ancestry of 239 543 polymorphic sites among C. jejuni and C. coli isolates 
inferred using structure assuming 2 populations. Genomes are ordered according to isolate NCTC 11168(Parkhill et al. 2000), and 
each nucleotide is coloured according to genetic ancestry to C. jejuni (■) or C. coli (□). (C) Histograms of nucleotide divergence 
between C. jejuni (isolate 4) and genomes from C. coli clade 1 (red), 2 (yellow), 3 (green) and C. jejuni (blue). Pairwise comparisons 
between C. jejuni and un-recombined clade 1 (isolate 23 in the example), and clade 2 and 3 genomes (upper panel) show a unimodal 
distribution with modes between 10 and 12%. Comparison of C. jejuni with C. coli clade 1 isolates from the ST-828 and ST1150 
complexes (middle panel) has a bimodal distribution with similar modes at 10-12% but with an earlier mode at <2%. The nucleotide 
divergence of the earlier mode is similar to comparison between two C. jejuni genomes (bottom panel). 



The second line of evidence for introgression into 
C. coli clade 1 is provided by pairwise comparison of 
nucleotide differences between genomes. In the absence 
of gene flow, isolates from the two species should have 
a unimodal distribution of divergence levels reflecting 
accumulation of mutations throughout the genome. This 
pattern was observed for comparisons between C. jejuni 
and unintrogressed C. coli isolates, with modes between 
10% and 12% (Fig. 1C). Comparisons with the ST-828 
and ST-1150 clonal complex isolates showed a bimodal 
distribution with similar modes at 10-12% but also 
earlier modes at <2%. The low nucleotide divergence 
was consistent with recent recombination with C. jejuni. 
Combined with evidence for shared polymorphism 
found using structure, these patterns of divergence are 
consistent with recent gene flow from C. jejuni. The 
imported DNA has greater nucleotide identity to the 
agricultural C. jejuni isolates than the environmental 
C. jejuni isolates (Fig. S2, Supporting information). 

The third line of evidence is provided by constructing 
maximum likelihood trees separately for loci according 
to whether they have any C. jejuni ancestry according 
to clonalframe. This analysis was performed for the 51 
ribosomal protein subunit irps) loci in our genomes 
(Jolley et al. 2012). When analysis was limited to genes 



where there was no C. jejuni-tike sequence, the clade 1 
strains clustered together (Fig. 2), consistent with their 
shared common ancestry as was found previously for 
seven MLST loci (Sheppard et al. 2008). Furthermore, as 
in this previous analyses, the branching pattern posi- 
tioned clade 2 as a sister taxa to clade 1. In contrast, on 
the tree for rps loci that showed evidence of interspecies 
recombination, C. coli clade 1 isolates were scattered 
around the branches joining the two species in the tree 
(Fig. 2). This analysis suggests that C. coli clade 1 is a 
real clade and that the presence of clade 1 isolates on 
three different parts of the whole genome neighbour 
joining (Fig. 1A) is an artefact of the substantial effect 
of interspecies recombination on genetic distances. 

Core and pan genome evolution 

Having established that up to 23% of the C. coli clade 1 
genome is of C. jejuni origin, we investigated evolution 
within the clade and the sequence of events responsible 
for introgression. An alignment of C. coli clade 1 
genomes was constructed (excluding one strain with 
low genome coverage) and a tree of clonal relationships 
was estimated using clonalframe (Didelot & Falush 
2007). The analysis showed that the ST-828 and ST-1150 
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(A) Unrecombined genes 
C. coli Clade 2 

1 11 




(B) Recombined genes 



18 ^20 



C. co// Clade 3 



C. coli Clade 1 



C. co// Clade 2 



0.01 



14 



C. jejuni 





C. jejuni 



Fig. 2 The effect of interspecific recombi- 
nation on tree clade structure. Maximum 
likelihood trees, based on the Tamura- 
Nei model, of 30 Campylobacter jejuni and 
C. coli genomes are based on concate- 
nated sequences of ribosomal protein 
(rps) subunit loci genes that show (A) 35 
genes with no evidence of homologous 
recombination and (B) 16 with evidence 
of recombination in at least one isolate 
using clonalframe. Isolates belonging to 
C. jejuni are shown in blue, and those 
belonging to C. coli clade 1 are indicated 
in red, clade 2 in yellow, and clade 3 in 
green. The trees are drawn to scale, with 
branch lengths measured in the number 
of substitutions per site. The scale bar 
represents a genetic distance of 0.01. 
Trefoil clade structure is resolved in non- 
recombining genes. 



clonal complexes are more closely related to one 
another than to an environmental isolate (isolate 23) 
(Fig. 3A). Most of the C. jejuni DNA found in the 
ST-828 complex was also found in the ST-1150 complex 
(Fig. IB and Fig. S3, Supporting information), implying 
that this genetic material was imported by the common 
ancestor(s) of both complexes. Subsequent to the diver- 
gence of the two complexes, the ST-1150 complex has 
acquired substantially more C. jejuni DNA than the 
ST-828 complex although import is ongoing in both 
complexes. 

In addition to recombination of homologous 
sequence, our approach allows us to investigate the 
evolution of the pan genome, which occurs via acquisi- 
tion and loss of genes. A multiple genome alignment 
was constructed for C. coli clade 1 isolates using mauve 
(Darling et al. 2004), and genoplast (Didelot et al. 2009) 
was applied to the alignment blocks to identify those 
that were gained or lost on particular branches of the 
tree (Fig. 3C). The origin of imported DNA was 
inferred by blast comparison of the sequences to refer- 
ence genomes from C. jejuni and C. coli clade 2 and 3. 
An equivalent analysis of origin was performed for 
homologous imports (Fig. 3C). In total, 438 nonhomolo- 
gous and 2237 homologous recombination events were 
inferred, although for methodological reasons, both 



homologous and nonhomologous events were only 
identified reliably towards the tips of the tree. In both 
cases, approximately 50% of events were of C. jejuni 
origin although for homologous recombination, the pro- 
portion is higher for many of the short branches. This 
demonstrates that a high magnitude of introgression 
has occurred in both the core and pan genomes. 

Effects of introgression on the core genome 

The availability of unintrogressed C. coli genomes pro- 
vides insight into where introgression has occurred and 
its genomic effects. Data in a variety of bacterial species 
have suggested that recombination is homology depen- 
dent (Cohan 2002; Fraser et al. 2007). We found that 
recombination was rarer in areas of the genome where 
there was high divergence between C. jejuni and the 
unintrogressed C. coli (Fig. 4). However, the observed 
degree of homology dependence was several orders of 
magnitude weaker than in other species where 
mismatch repair mechanisms prevent the integration of 
most sequences that contain even small numbers of 
nucleotide differences (Cohan 2002; Fraser et al. 2007). 
Based on a conservative threshold for identifying inter- 
species recombination events, clonalframe analysis 
implied that 9 (95% credibility regions 5-16) times as 
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(A) Genealogy 



(B) Events on an 
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(C) Origin of recombination 

events across the genome 

Homologous Non-homologous 

C. coli C. jejuni C. coli C. jejuni 

0 4 0 0 
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14 28 0 0 
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58 63 0 0 
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136 56 0 0 
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Fig. 3 History of recombination in C. coli 
clade f. (A) Genealogy inferred using 
clonalframe. (B) Recombination and 
mutation events on each branch of the 
genealogy, for an example, 1046 bp gene 
(metC2). Crosses indicate substitutions 
by either mutation or recombination. 
Recombination events inferred with high 
posterior probability are coloured accord- 
ing to whether the imported DNA is 
more similar to genomes from Campylo- 
bacter jejuni (blue) or C. coli clades 2 and 
3 (yellow). (C) Total number of inferred 
homologous and nonhomologous imports 
on each branch from each species across 
the genome. Isolate 16 was excluded from 
this analysis because of poor genome 
coverage. 



many substitutions were introduced on average by 
interspecies recombination as by mutation or intraspe- 
cies recombination. Within the most divergent regions 
of the genome (approximately 20% nucleotide diver- 
gence), the rate of interspecies exchange is approxi- 
mately half the genome-wide average, but this still 
implies nucleotides are at least four times more likely 
to be changed by cross-species recombination than by 
new mutation or within-species recombination. This 
level of recombination would lead to progressive spe- 
cies convergence if maintained throughout the genome 
over time (Sheppard et al. 2008). 

Comparing a unintrogressed C. coli clade 1 isolate 
(isolate 23) with a C. jejuni genome (isolate 27) shows 
that there are on average 38 nonsynonymous substitu- 
tions per gene. This is approximately an order of mag- 
nitude more than in pairwise comparisons within 
unintrogressed Campylobacter populations, which range 
from 1.4 for C. coli clade 2 to 5.5 for C. jejuni, compara- 
ble to 2.9 nonsynonymous substitutions per gene 
between human and chimpanzee, for example (Consor- 
tium(TCSaA) (2005). We found that introgression has 
taken place at equivalent rates in regions of both high 
and low nonsynonymous differentiations between the 
species (Fig. S4, Supporting information) and that there 
are on average 9.8 protein coding differences per gene 
between an unintrogressed clade 1 isolate (strain 23) 
and a member of the ST-1150 complex (strain 25). There 
was also no evidence for large differences in the rate of 
introgression between broadly defined functional 
categories (Fig. S5, Supporting information). Thus, intro- 
gression has greatly increased overall genetic diversity 
across the genome in C. coli clade 1 and introduced 
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Fig. 4 Homology dependence of recombination between 
Campylobacter jejuni and C. coli. The distribution of divergence 
between C. jejuni (isolate 4) and unintrogressed C. coli clade 
1 (isolate 23) for the recombinant genes (white) and the nonre- 
combinant genes (black). Recombination is rarer in areas of the 
genome where there is high divergence between species but 
the effect is slight with a large overlap between the two distri- 
butions. Recombination occurs between genes at all levels of 
divergence. 



thousands of changes that have potential functional 
significance. 

Effects of introgression on the accessory genome 

Genome comparison identified differences in gene 
content of C. jejuni and C. coli. By organizing the 88 
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C. jejuni genes absent from unintrogressed C. coli into 
functional categories, large groups of genes of related 
function were identified (Table S3, Supporting informa- 
tion). Particularly notable are: (i) solute transporters of 
different families, which may reflect differential nutrient 
utilization in the two species; (ii) specific cytochromes c 
and their associated biogenesis proteins, which may be 
related to the use of specialized respiratory substrates; 
and (iii) TonB-dependent outer membrane proteins 
potentially involved in iron uptake. The list also 
includes genes in diverse functional categories includ- 
ing those involved in core cellular function. Detailed 
characterization of the functional consequences of these 
differences requires further investigation. 

Thirty-one genes were identified only in C. jejuni and 
introgressed C. coli based on blast similarity (Table 1). 
Several genes were from a region (Cj0480c-Cj0490) that 
has recently been shown to be involved in the transport 
and metabolism of L-fucose (Muraoka & Zhang 2011). 
Most sugars cannot be used as growth substrates for 
Campylobacter, due to lack of 6-phosphofructokinase 
(PFK) (Velayudhan & Kelly 2002), so these genes pre- 
sumably allow conversion of fucose to triose phosphate 
or another intermediate that bypasses PFK for entry 
into central metabolic pathways. Fucose is a major com- 
ponent of host glycoproteins, particularly intestinal 
mucin, and obtaining these genes from C. jejuni could 
allow C. coli to utilize fucose and provide an advantage 
in the gut (Stahl et al. 2011). The list of genes present 
only in C. jejuni and introgressed C. coli also includes 
key genes associated with flagella. Phylogenetic trees of 
the fucose-associated genes show some isolates have 
sequences with high nucleotide identity to those found 
in C. jejuni, but other sequences are entirely distinct, 
suggesting that the genes were present in the C. coli 
pan genome prior to introgression and have been lost 
by unintrogressed strains (Fig. S6, Supporting informa- 
tion). Intriguingly, several distinct genotypes exist among 
the C. jejuni-like sequences consistent either with rapid 
diversification or with introgression on multiple occa- 
sions, suggestive of recent selection at that locus (Falush 
2009). 

Discussion 

Extensive recent recombination between species has a 
number of consequences for the patterns of DNA 
sequence diversity observed. First, there should be 
regions of the genome where individuals from the two 
species have highly similar sequence, reflecting recent 
common ancestry of DNA that has been imported from 
one species to the other. Second, patterns of inheritance 
for some stretches of the genome will be inconsistent 
with the consensus species tree. Third, recombination 



may elevate diversity within introgressed populations 
by introducing substitutions that were fixed during 
species divergence. 

Here, we have sought to systematically investigate 
genome-wide patterns of relatedness between C. jejuni 
and C. coli by searching for each of these types of signal 
within homologous parts of the genome. First, in pair- 
wise comparison of sequence diversity between C. jejuni 
and different C. coli isolates, C. coli that dominate in 
agriculture (ST-828 and ST-1150 complexes) had large 
fractions of the genome with low divergence from 
C. jejuni — typical of that found between two C. jejuni 
isolates. Importantly, isolates from nonagricultural 
C. coli lineages did not have comparable low divergence 
regions. Second, we used the linkage model of struc- 
ture to identify sections of the genome with ancestry in 
both species. Consistent with the pairwise sequence 
analysis, there was very little evidence for introgression 
amongst C. jejuni and nonagricultural C. coli isolates but 
a high degree of introgression in agricultural C. coli. 
Third, we observed the diversifying effect of recombina- 
tion directly by reconstructing clonal relationships and 
the specific imports that took place during the evolution 
of the agricultural C. coli clonal complexes, using 
clonalframe. We found large numbers of imports that 
introduced changes at approximately 12% of sites, con- 
sistent with an origin in C. jejuni. This finding also 
implies that introgression is ongoing in both the ST-828 
and ST-1150 clonal complexes. 

A scenario of Campylobacter evolution 

We have reconstructed an evolutionary scenario (Fig. 5) 
within which to interpret patterns of diversity in both 
core and pan genomes. C. coli split from C. jejuni and 
subsequently diversified into three clades. Bacteria from 
both species show evidence of recombination within 
species and clade (for C. coli) from a version of the four 
gamete test applied to the two most common MLST 
alleles in each clade population (Sheppard et al. 2010c). 
This has contributed to generating the 'star-like' phytoge- 
nies amongst the C. coli clade 2 and 3 isolates in our 
sample (Fig. 1). However, genetic exchange between 
bacteria from different species and lineages has been rare 
enough to facilitate their progressive divergence, which 
has reached approximately 12% between C. jejuni and 
C. coli and around 4% between the three C. coli lineages. 

More recently, within C. coli clade 1, a lineage arose 
that started to import substantial quantities of C. jejuni 
DNA. This lineage has given rise to two clonal com- 
plexes, the ST-828 and ST-1150 complexes. These two 
clonal complexes currently make up the great majority 
of typed agricultural isolates but a small proportion of 
nonagricultural ones, as shown by the epidemiologi- 
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Table 1 Campylobacter jejuni genes with homologous sequence (70% blast similarity, >50% of the gene) present among introgressed 
C. coli clade 1 genomes and hypotheses about their potential function 



Gene 



Product 



Description — Hypothesis 



Transport and metabolism of L-fucose 

Cj0480c Transcriptional regulator 



Cj0481 

(annotated as dapA) 



Q0482/0483 
(uxaA) 



Cj0484 

Q0485 
Cj0486 



Cj0489/Cj0490 
Zinc uptake system 
Q0263 
Cj0620 
qi589 



Flagellin-associated 
Cjl339 (FlaA) 
Cjl338 (FlaB) 
q0548 (FHD) 
qi299 (AcpP2) 

qi409(AcpS) 
Miscellaneous 
q0555 

qi297 

qi365c 

qi506c (CcaA) 

qi051c (CjeT) 
qil34 (htrB) 

qi414c (KpsC) 
qil87c (ArsB) 

q0308c (BioD) 



Putative dihydropicolinate 
(DHP) synthase 



Putative altronate or D-galactarate 
(sugar) hydrolase 

Major facilitator superfamily 
transport protein 



Dehydrogenase/oxidoreductase, 
FabG family. 

Probable L-fucose transporter 



Putative aldehyde dehydrogenase 

Zinc transporter ZupT 
Zinc-dependent protease 
Zinc-dependent hydrolase, 

possibly a beta-lactamase or 

glyoxalase II 



genes 



Flagellin protein 
Flagellin protein 
Hook-associated protein 
Acyl carrier protein for the 
O-linked glycosylation locus 
Holo-acyl carrier protein synthase 

Putative malonate 

(HOOC.CH2.COOH) transporter 
Putative component of the 

efflux system 
Secreted serine protease 

Chemoreceptor for aspartate A 



Lauroyl acyltransferase 

Part of the capsule locus 
Arsenical efflux pump 

Dethiobiotin synthase 



This is divergently transcribed from the other genes in this unit and is 
likely to be regulating the rightward reading genes. Cj0480 is 
an IclR family regulator (Gundogdu et al. 2007) and could be acting 
as a repressor or activator inducing expression of catabolic and 
transport genes in response to L-fucose 

DHP, also present in Bacillus spores, is an intermediate of a variant 
of the lysine biosynthesis pathway from aspartate that catalyses the 
condensation of aspartate semi-aldehyde with pyruvate to form DHP. 
Cj0480 may catalyse a related lyase reaction, as there is another gene 
Cj0806 that could be the 'real' dapA; it may be involved as a lyase in 
a step of fucose catabolism? 

Could be a pseudogene because the N-terminus encoded in Cj0482 and 
the C-terminus in Cj0483 is separated by a stop codon. There are 
examples where such genes are expressed. Possible fucose hydrolase? 

Probably, a substrate-proton symporter to import a substrate driven 
by the pmf. It has some similarity to phthalate (aromatic) family 
transporters (Gundogdu et al. 2007). However, it is not possible to 
say what the substrate is likely to be from sequence data 

This is possibly an alcohol dehydrogenase 

This is a sugar transporter of the major facilitator superfamily, 
with significant similarity to the L-fucose—proton symporter of 
E. coli and other bacteria (Gundogdu et al. 2007). This is essential 
for L-fucose utilization in some C. jejuni strains 
(Muraoka & Zhang 2011) 

Potentially involved in a step of fucose catabolism? 

There may be a connection between the zinc uptake system genes 
in supplying zinc for the activity of the protease. A number of 
proteins contain the Cjl589 domain, so it is difficult to predict the 
function but there may be a zinc connection with Cj0263. 



Flagellin-associated proteins that could be involved in niche 
colonization. The presence of AcpP2 and AcpS could suggest 
O-linked glycosylation of flagellin proteins being important. 



This could be involved in growth on malonate, but this is an 
uncommon plant-derived carbon source. 
Speculatively associated with antibiotic efflux. 

Could be associated with breakdown of specific proteins for growth 
on amino acids 

Chemotaxis towards aspartate, as facilitated by CcaA, is involved in 
the colonization of the intestinal tract (Hartley-Tassell et al. 2010). 
A restriction modification enzyme 

Enzyme involved in the biosynthesis of LipidA. This will probably 
be an essential gene 

Probable capsule polysaccharide modification gene 
Used for detoxification. Actual substrate cannot be predicted, 
but Cjl297 may also have a related broad detoxification function 
Involved in synthesis of the cofactor Biotin. 
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Fig. 5 A scenario for the evolution of Campylobacter jejuni and 
C. coli. These species diverged followed by the split of C. coli 
clades 1, 2 and 3. Recombination from C. jejuni to C. coli clade 
1 began at some point before Rl, and subsequent clonal expan- 
sion of introgressed lineages (828 and 1150 clonal complexes) 
at Rl and R2 led to the dominance of hybrid lineages in 
agriculture, human disease and currently available isolate 
collections. Clade 2 (C2) and 3 (C3) and clade 1 (CI*) popula- 
tions from wild bird and environmental reservoirs (e.g. repre- 
sented by isolates 16 and 23) remained unintrogressed. The 
cross-sectional area and diameter of the lineage 'trunks' are 
based on the abundance of isolates in the PubMLST database 
and the length of trunks is arbitrarily defined. 



Changes in patterns of gene flow are substantially 
harder to demonstrate in taxa where divergence is less 
complete or where there are no unintrogressed ancestral 
clades preserved. Among bacteria in particular, diver- 
gence may be uneven across the genome because of 
'fragmented speciation' (Retchless & Lawrence 2010). In 
this model, gene flow ceases first in some parts of the 
genome, for example, in regions responsible for adap- 
tive divergence. This acts as a barrier to subsequent 
recombination, and progressive diversification occurs at 
other loci around the genome. This model is not 
directly applicable here because isolates found in all 
three C. coli clades had similar high divergence with 
C. jejuni across the genome (Fig. 1C), which implies that 
after speciation, there was an extended period of diver- 
gence with low levels of gene flow. Nevertheless, it is 
of interest to investigate whether the high rate of 
genetic differentiation between C. jejuni and C. coli has 
acted as a barrier to recent recombination within the 
two agricultural C. coli lineages. Recombination was 
twice as rare in regions with 20% nucleotide divergence 
than in regions with 10% (Fig. 4). However, even 
among divergent regions, the rate of recombination was 
sufficient to promote progressive species convergence at 
current levels of DNA exchange. 



cally defined sample shown in Table 2. The rate of 
acquisition of DNA has been substantially higher for 
the ST-1150 complex than for the ST-828 complex. 
Because of the great preponderance of agricultural 
strains in almost all sample collections, several of our 
C. coli clade 2 and 3 isolates that we obtained for 
genome sequencing come from agriculture (Table SI, 
Supporting information) but these isolates have not 
undergone substantial introgression. 

Table 2 Distribution of C. coli lineages among different sources* 

Source 



Comparison with other studies of introgression in 
Campylobacter 

Our results contrast with those of a recent study (Lefebure 
et al. 2010) that analysed a larger number of genomes but 
failed to find evidence for substantial introgression 
(except in a single isolate). The disagreement reflects 
differences in sampling and methodology rather than 
biology. All but one of the C. coli isolates analysed by 



Farm Clinical Riparian 



Clade Isolates STs Isolates STs Isolates STs 



1 ST-828 complex 915 216 

1 ST-1150 complex 38 12 

1 Clade 1 other 201 107 

2 0 0 

3 2 2 



481 86 0 0 

0 0 0 0 

19 17 0 0 

0 0 37 31 

0 0 30 25 



MLST typed farm isolates (from cattle, chicken, pig, sheep or turkey faeces or meat), riparian isolates (from duck, swan, pigeon and 
gull faeces and environmental water samples),and clinical isolates (from human blood and faeces) are from published studies and 
defined locations (Sheppard et al. 2010b). Most clade 1 isolates that are not part of the ST-828 or 1150 complexes, nevertheless, share 
alleles with them suggesting recent common ancestry. Note that isolates sequenced for the current study were taken from a wider 
collection including additional riparian and wild bird isolates. 
*Sheppard et al. (2010b). 
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Lefebure et ah (2010) came from the ST-828 complex and 
were estimated to have <1% introgression on average 
(Lefebure et ah 2010). The single exception was a member 
of the ST-1150 complex, for which introgression was 
inferred at 9.6% of genes. We estimated 23% introgres- 
sion for the same isolate and at least 8% introgression for 
the ST-828 complex isolates. Moreover, the strains from 
Lefebure et ah (2010) are intermingled with the strains 
analysed here in a NJ tree constructed using whole 
genome sequences (Fig. S7, Supporting information). 

The substantial underestimation of introgression in 
the previous study (Lefebure et ah 2010) reflects the 
absence of nonagricultural strains from their sample 
and the use of a single methodology that is sensitive to 
sampling. Specifically, Lefebure et ah (2010) looked for 
patterns of ancestry inconsistent with the species tree 
by using gene-by-gene phylogenies to identify loci 
where a minority fraction of C. coli isolates clustered 
closer to C. jejuni than to other C. coli. This approach 
systematically misses introgression shared by a majority 
of isolates within the ST-828 complex. Fig. S8 (Support- 
ing information) shows neighbour-joining trees for the 
13 genes that showed the highest levels of introgression 
in our analysis (>70% in each case), constructed with 
and without the isolates from this study. For most of 
the genes, the C. coli isolates from the study of Lefebure 
et ah (2010) contain sequences similar to those found in 
C. jejuni, while in every case, the nonagricultural C. coli 
isolates in our sample harbour entirely distinct 
sequences. Even in the cases where a handful of the 
Lefebure et ah (2010) ST-828 complex isolates do have the 
distinct — and presumably ancestral — C. coli sequence, 
the method they employed did not detect introgression 
because these isolates make up only a minority of their 
isolates. As a result, they inferred no introgression for 
these 15 genes, which is clearly an incorrect conclusion 
based on visual inspection of the neighbour-joining 
trees constructed using the combined sample (Fig. S8, 
Supporting information). 

A further limitation of the method employed by 
Lefebure et ah (2010) is that because phylogenies were 
constructed only for entire genes, they were likely to 
miss imports of short gene fragments. The linkage 
model of structure uses a hidden Markov model, which 
can detect any tract long enough to introduce polymor- 
phisms at several sites that are characteristic of the 
other species, and we observed many such imports that 
were much shorter than entire genes (Fig. 3). 

Ecological and evolutionary implications of 
introgression 

While our analyses demonstrate introgression after a 
substantial period of little gene flow, many questions 



remain about its causes and adaptive consequences. 
One possible explanation for the increased uptake of 
C. jejuni DNA by C. coli lineages that are more common 
in agriculture (Table 2) is enhanced physical opportu- 
nity for genetic exchange associated with cocolonization 
of agricultural hosts by the two species. Little is known 
about the host range for nonagricultural C. coli, which 
include clades 2 and 3 and unintrogressed clade 1 
isolates (Sheppard et ah 2010b), but it might be princi- 
pally composed of reservoirs that are not colonized by 
C. jejuni. One challenge for a simple model in which 
recombination rates are regulated by physical proximity 
is the low rate with which agricultural C. jejuni have 
acquired C. coli DNA, although this might in part be 
explained by a higher frequency of C. jejuni in hosts in 
which they co-occur. 

Given the absence of unintrogressed C. coli in food 
animals, it is possible to speculate that introgression 
may have provided key adaptations for proliferation in 
the agricultural niche. Introgression has led to C. coli 
clade 1 having the largest pan genome (Fig. S9, 
Supporting information) and to the import of several 
genes involved in the transport and metabolism of 
L-fucose (Table 1), which has been shown to be impor- 
tant in colonizing hosts (Muraoka & Zhang 2011). There 
are other examples of pathogenic lineages that are pro- 
posed to have arisen after a rapid burst of genome-wide 
introgression. These include Salmonella Paratyphi A and 
Typhi (Didelot et ah 2007) and Vibrio vulnificus (Bisharat 
et ah 2005). However, in Campylobacter, the genetic distance 
between hybridizing lineages is far greater and is compa- 
rable to that between Escherichia coli and Salmonella 
(Ochman & Groisman 1994) or a human and a marmoset 
(Peng et ah 2009). 

Whatever its adaptive benefits, the interspecies 
recombination that has been observed in C. coli presents 
a challenge to some views of how bacterial evolution 
proceeds. It has been proposed that most changes in 
proteins occur by coevolution, with substitutions in one 
protein resulting in selection pressure for reciprocal 
changes in interacting partners (Fraser et ah 2002). 
Recombination between divergent species would be 
expected to disrupt large numbers of evolved interac- 
tions and, if the genetic distance was sufficiently great, 
would be likely to create hybrids, or 'hopeful monsters' 
(Mayr 1970), with little chance of evolutionary success. 
Here, there is an average of 40 protein coding differ- 
ences between the two species. 

To surmount the deleterious effects of disrupting 
numerous epistatic interactions, there would need to be 
substantial fitness advantages. Novel or extreme envi- 
ronments provide a setting within which hopeful mon- 
sters can be generated and proliferate because of the 
absence of well-adapted organisms (Rieseberg et ah 
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2003). There are many features that might make live- 
stock such a habitat for bacteria that have evolved in 
wild hosts. Furthermore, hopeful bacterial monsters can 
repair some of the most harmful disruptions to interac- 
tions of adaptive genes by subsequent homologous 
recombination of their interaction partners. Sequencing 
of larger numbers of isolates will allow more detailed 
characterization of this ongoing adaptive process and 
further develop our understanding of bacterial gene 
networks. 
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